Generalized Additive Models in Fraud Detection
Data Science Capstone Project
Grace Allen, Kesi Allen, Sonya Melton, Pingping Zhou
2025-11-21
Introduction
What are generalized additive models?
Not your typical straight-line regression — GAMs let patterns curve naturally
Great at uncovering hidden trends in messy real-world data
Each feature gets its own shape, showing where risk rises or falls
Makes the model’s behavior easy to explain to non-technical teams
Perfect for fraud detection, where small pattern changes matter
Brief History of GAMs
Generalized Additive Models were introduced in the late 1980s as a way to add flexibility to traditional regression models. Trevor Hastie and Robert Tibshirani developed the framework to allow each predictor in a model to follow its own smooth pattern rather than forcing everything into a straight line. Through the 1990s and early 2000s, the approach grew in popularity in fields that needed interpretable models, including public health, ecology, and social sciences.
Brief History of GAMs
A major step forward came with the development of the mgcv package in R, created by Simon Wood. His work added modern smoothing techniques, automatic penalty selection, and faster computation, making GAMs practical for large and noisy datasets. Today, GAMs are widely used in finance, fraud detection, risk scoring, and other areas where organizations need both predictive accuracy and clear explanations.
GAMS in Action: Real World Uses + Our Study
GAMs help uncover nonlinear relationships and subtle patterns across diverse domains:
GAMS in Action: Real World Uses + Our Study
GAMs help uncover nonlinear relationships and subtle patterns across diverse domains:
Our Project: Study Context: GAMs for Fraud Detection
Toolset: RStudio + package
Dataset: Kaggle’s Fraud Detection Transactions (Ashar, 2024)
Purpose: Identify predictive variables linked to fraudulent activity
Context: Synthetic but realistic data for controlled testing
Here’s how we used GAMs to explore patterns in the fraud dataset.
Methods
GAM Modeling Overview
GAMs extend traditional regression
Capture nonlinear predictor-response relationships
Use spline-based smooth functions
Combine continuous + categorical predictors
Fit with mgcv (penalized splines + GCV)
Model outputs interpretable smooth effects
Goal: Estimate probability of fraud
Modeling Workflow Steps
1. Data Acquisition
2. Data Exploration & Cleaning
3. Categorical Summary
4. Visualizations
5. Assumptions
6. GAM Analysis
7. GAM Model for predictors
8. Model Performance
9. Final Interpretation
GAM Equation
\[ g(\mu) = \alpha + s_1(X_1) + s_2(X_2) + \dots + s_p(X_p) \]
(g) = link function (logit for binary fraud)
Smooth functions capture nonlinear effects
Additive contributions from each predictor
Balances flexibility + interpretability
GAM Assumptions (Fraud Context)
Logit link approximates fraud probability
Additive and independent predictor effects
Smooth, gradual functional relationships
Binomial response distribution
Independent observations
Low predictor multicollinearity
Penalization prevents overfitting
Why We Chose GAMs For Fraud Detection
Captures nonlinear fraud patterns
Handles rare, imbalanced outcomes
Produces interpretable smooth risk curves
Supports regulatory transparency
Balances accuracy + interpretability
Strong literature support for fraud analytics
Scalable through mgcv’s automated smoothing
Practical Advantages & Relevance to Real-World Analytics
Supports investigative decision-making
Shows monotonic or nonlinear risk curves
Supports investigative decision-making
Can benchmark or surrogate black-box models
High recall for suspicious transactions
Useful for auditors, fraud teams, analysts
Aligns with both operational and compliance needs
Analysis and Results
Data Exploration and Visualization
Dataset Description
What It Is
Analysis and Results
Data Exploration and Visualization
Why We Use It
Analysis and Results
Data Exploration and Visualization
What Makes It Special
Realistic fraud patterns:
Groups of fraudulent transactions
Subtle, hard‑to‑notice anomalies
Odd user behaviors
Large & diverse records: balances normal vs. rare fraud cases → addresses class imbalance.
Data Exploration and Visualization
Key Characteristics
What’s Inside
50,000 Rows: A good amount of data to work with.
Two Labels: Every transaction is marked as either: 1 = Fraud 0 = Not Fraud
Data Exploration and Visualization
Data Features– 21 features across three categories:
Numbers: Like transaction amounts, risk scores, account balances.
Categories: Transaction types (payment, transfer, withdrawal), device types, merchant categories.
Time Data: When transactions happened (time, day) and their sequence.
Data Exploration and Visualization
Label Distribution Class Imbalance:
Fraudulent transactions are a small percentage, reflecting real-world scenarios.
Behavioral Realism: Includes unusual spending, behavioral signals, and high-risk profiles.
Modeling flexibility: supports interpretable (GAMs, logistic regression) or high-performance (XGBoost) approaches
Distribution of Variables
| Type |
Count |
| POS |
12,549 |
| Online |
12,546 |
| ATM Withdrawal |
12,453 |
| Bank Transfer |
12,452 |
| Device |
Count |
| Tablet |
16,779 |
| Mobile |
16,640 |
| Laptop |
16,581 |
| Merchant_Category |
Count |
| Clothing |
10,033 |
| Groceries |
10,019 |
| Travel |
10,015 |
| Restaurants |
9,976 |
| Electronics |
9,957 |
Distribution of Variables
Non-linearity Check
Modeling and Results
Assumptions
GAM Analysis for Numeric Variables
GAM Analysis for Categorical Variables
GAM Model for Key Predictor
GAM Equation for Key Predictor
GAM equation structure:
\[ g(\mu) = \alpha + s_1(X_1) + s_2(X_2) + \dots + s_p(X_p) \]
our model simplifies to a single predictor:
\[ \text{logit}(\Pr(\text{Fraud} = 1)) = \alpha + s(\text{Risk\_Score}) \]
where alpha = 1.9109 is the intercept, representing the baseline log-odds of fraud when Risk_Score is zero.